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The global estimation problem of the drift function is considered 
for a large class of ergodic diffusion processes. The unknown drift S(-) 
is supposed to belong to a nonparametric class of smooth functions of 
order k > 1, but the value of k is not known to the statistician. A fully 
data-driven procedure of estimating the drift function is proposed, 
using the estimated risk minimization method. The sharp adaptivity 
of this procedure is proven up to an optimal constant, when the 
quality of the estimation is measured by the integrated squared error 
weighted by the square of the invariant density. 

1. Introduction. 

1.1. The problem. In this paper we consider the statistical problem of 
estimating the drift function of a diffusion process X, given as the solution 
of the stochastic differential equation 

(1) dX t = S(X t ) dt + a(X t ) dW t , X = ^,t>0, 

where W is a standard Brownian motion and the initial value £ is a random 
variable independent of W. We assume that a continuous record of observa- 
tions X T = (Xt,0 < t < T) is available. The goal is to estimate the function 
£(•), which is commonly referred to as the drift function and is interpreted 
as the instantaneous mean of the process X. 

In our setup, the diffusion coefficient a 2 is identifiable using the quadratic 
variation of the semi-martingale X. Therefore, the problem of its estimation 
is not interesting from the viewpoint of asymptotic statistics. In the sequel, 
we suppose that cr(-) 2 is a known function satisfying some boundedness and 
smoothness conditions. 
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During the past decade statistical inference for continuous time Mar- 
kov processes has been widely developed due to its numerous applications, 
namely, in mathematical finance and econometrics. In fact, the diffusion 
processes and their extensions, such as jump-diffusions and the solutions of 
stochastic differential equations driven by Levy processes, are often used to 
model the evolution of asset prices and derivative securities. 

Estimation problems based on both continuous time and discretely sam- 
pled observations have been considered in the statistical literature. The first 
one is more interesting from the point of view of financial econometrics 
(see [1, 25]), since in finance, even if the underlying process is time contin- 
uous, only its values at a finite number of points are available. 

However, the theoretical development of statistical inference with a conti- 
nuous time record of observations turns out to be technically simpler than 
inference based on discretely observed data. It permits one, therefore, to go 
further in the statistical analysis of the model and to answer some questions 
open up to now for discretely observed diffusions. 

Note also that the continuous-time model can be considered as the limit of 
a discrete-time model when the step of discretization goes to zero (see [30] ) . 
Therefore, if the available data is "dense enough" with respect to the obser- 
vation time, the asymptotic behavior of estimation procedures, in practice, 
may be close to the asymptotic behavior proven theoretically for continuous- 
time observations. Thus, the knowledge of the best estimator based on 
continuous-time data is of practical interest as well. 

The purpose of the present paper is to estimate the drift function globally, 
that is, at any point i£M. We consider the case of ergodic diffusions, which 
means that the Markov process X admits an invariant measure. Let fs 
denote the density of this invariant measure with respect to the Lebesgue 
measure on M. (cf. [18], Chapter 4, Section 18, for more details). To quantify 
the performance of an estimator St{-) = St{-,X t ) of the drift S(-), we use 
the weighted L 2 -risk 



where Eg is the expectation with respect to the law P5 of X defined by (1). 
We call an estimating procedure adaptive if its realization does not require 
any a priori information on the estimated function. The only information 
that we may (and should) use is that contained in the observations. We call 
an estimating procedure minimax sharp adaptive, or simply sharp adaptive, 
if its minimax risk converges with the best possible rate to the best possible 
constant. 

The main focus of this paper is constructing an estimating procedure 
which is minimax sharp adaptive with respect to the risk (2), when T — > 00. 
The estimator of the drift we propose enjoys the following properties: 
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- It is fully data-driven; particularly, it does not depend on the smoothness 
of the estimated function. 

- In the case when the observed path is generated by a stationary diffu- 
sion, the estimator is minimax rate-optimal over the Sobolev balls on any 
compact interval. The quality of estimation is then measured by the mean 
integrated squared error (MISE). 

- Still in the case of an ergodic diffusion, if the risk function is defined 
by (2), the estimator is asymptotically sharp adaptive over a large scale 
of Sobolev balls: it attains not only the optimal rate of convergence, but 
also the best possible constant. Moreover, the accuracy of the adaptive 
procedure is asymptotically as good as the accuracy of the best possible 
nonadaptive procedure. 

1.2. Adaptive estimation. The first result concerning minimax sharp adap- 
tivity in nonparametric curve estimation is due to Efromovich and Pinsker [11]. 
It was extended by Golubev [19] and Golubev and Nussbaum [22] for non- 
parametric regression and by Efromovich [9] and Golubev [21] for density 
estimation from i.i.d. data. Similar results have been obtained in some other 
contexts as well (we refer to Chapter 7.4 of [10] for a comprehensive discus- 
sion), but they all deal with either independent or Gaussian observations. 
Thus, the main difference of our study is that the observations we have 
at our disposal are neither independent nor Gaussian. Moreover, as follows 
from heuristics presented in Section 4.1, our model exhibits heteroscedastic 
structure. 

1.3. Estimation for diffusions. For a complete review of parametric and 
nonparametric methods for diffusion processes, we refer to [14], [24] and [27]. 

There are a number of papers devoted to the estimation of the drift in 
the case when the parameters (such as smoothness, Lipschitz constant) de- 
scribing the nonparametric class are known and when continuous-time ob- 
servations are available. Banon [3] proved the consistency in probability 
of kernel-type estimators, and Pham [31] obtained the rate of convergence 
in the same setup. These results have been extended by van Zanten [35], 
Galtchouk and Pergamenshchikov [17] and Kutoyants [27]. Pinsker 's con- 
stant in this problem is obtained in [6]. More recently, an approach making 
use of a random rate of convergence is developed in [8] . 

The adaptive estimation of the drift at a fixed point based on continuous- 
time observations has been studied by Spokoiny [34], who has applied the 
Lepskii method (see [28]) to the locally linear smoothers in order to con- 
struct an adaptive rate-optimal procedure. For discretely sampled diffusions, 
a rate-optimal adaptive procedure for estimating the drift function has been 
proposed by Hoffmann [23]. 
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In the problem of estimation and validation of the model with discretely 
sampled high frequency observations, which is frequently used in mathemat- 
ical finance, recent progress has been achieved by Fan and Zhang [15] and 
Ait-Sahalia and Mykland [2]. 

Note also that the diffusion processes can be regarded as continuous- 
time analogues of autoregressive processes. A theoretical result establishing 
the connection between these two models has been proven by Milstein and 
Nussbaum [29]. 

The paper is divided into five sections. The description of the adaptive 
procedure is given in Section 2. In Section 3 the assumptions on the model, as 
well as the main result describing the asymptotic behavior of the procedure, 
are formulated. Some comments related to this result and its proof are given 
in Sections 4 and 5, respectively. 

2. Construction of the adaptive estimator. The construction of the adap- 
tive estimator proposed in this paper relies hardly on the papers [4] and [21]. 
It can be divided into three steps according to the following scheme. First, 
we present an estimator of the drift function involving two kernel-type func- 
tions and two bandwidths. This estimator is rate-optimal if the orders of 
the bandwidths are chosen in a correct way. Second, we derive an asymp- 
totically exact upper bound of the risk of this estimator. The minimization 
of the maximum of this risk bound over the Sobolev ball of smoothness k 
and radius R provides the explicit forms of the optimal kernel and the opti- 
mal bandwidth, depending on k and R. The last step is to substitute some 
"good" data-driven approximation for these parameters. 

2.1. The nonadaptive estimator. Let K(-),Q(-) G L 2 (R) be two positive 
/c-times (k > 1) continuously differentiable symmetric functions such that 
/ K = J Q = 1, and let a = ot and v = uj< be two positive functions of T 
decreasing to zero. According to the Girsanov formula, for two drift functions 
S and S , the log-likelihood log(^(X T )) in the model (1) is given by 

Jo a 2 (X t ) 2 Jo o-\X t ) 

We have supposed in the above formula that the law of the initial value 
does not depend on S. A widely used idea for constructing nonparametric 
estimators is to find a local (around the state x) approximation A(9,X T ,x) 
of At(S, So, X t ) depending only on a finite-dimensional parameter 9 and to 
define the estimator as the value of the parameter 9 maximizing A(9,X T ,x). 
We use the following "local constant" approximation of the log-likelihood: 

He, x\ x) = [ T dXt - -4- f T q ( x -^) dt 

aTcr z {x) Jo \ Oct J lircr*{x) Jo V / 
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(in this expression, the terms not depending on 6 are dropped, since they 
have no influence on the definition of the MLE). It is evident that the max- 
imum of this expression is attained by 

(3) 9t(x) = — [ T K(^—^)dX t 
cut Jo V ar J 



VT JO \ V T 



-1 



provided that the denominator is different from 0. A similar algorithm but 
with local linear smoothers is used in [34]. As it is explained in Section 4.4 of 
[6], for symmetric functions K and Q, the asymptotic properties of the esti- 
mators defined via the local constant and the local linear smoothers coincide. 
That is why we restrict ourselves to the local constant approximation. 

One drawback of the estimator (3) is that it is not defined when the de- 
nominator is null. Different approaches for overcoming this problem have 
been proposed. Banon [3] has suggested increasing artificially the denomi- 
nator by a deterministic term e/(Tz^), which asymptotically vanishes but 
allows the denominator to stay positive. We adopt in this paper this ap- 
proach, but with a more careful choice of the term to be added. Note that, 
in the context of nonparametric regression, a similar approach is developed 
in [13]. 

The second drawback of the estimator (3) is the presence of the stochastic 
integral. To explain why this integral is undesirable, let us recall that our 
final goal is to choose all the parameters and, in particular, the bandwidth 
ax in a data-dependent way. If we replace a by an approximation depending 
on the observed path (Xt,0 <t<T), we obtain an anticipative stochastic 
integral. The manipulation of such integrals is technically more difficult than 
the manipulation of the Riemann integrals. 

In order to replace the stochastic integral by a Riemann integral, we apply 
the It 6 formula to the primitive of the function a~ 1 K((x — •)/«) and to the 
semi-martingale X: 



Xo V ol J Jo V OL J 2a Jo V oi 

We show that, in the ergodic case, the term Jj^ T K((x — y)/a) dy is asymp- 
totically negligible with respect to the other terms. Therefore, the stochastic 
integral K \{x — X t ) / a) dX t can be approximated by (2a) _1 Jq K'{{x — 
X t )/a)a 2 {X t )dt. 

According to these considerations, we modify the estimator (1) as 

§ = {l/a 2 )^K'{{x-X t )/a)^{X t )dt 

{2/v)^Q{{x-X t )/v)dt + {2e/v)e-^ 

where e = £t = e^/ lo&T and It = (logT)" 1 . It is proved in [27] that if the 
unknown drift function is /c-times continuously differentiable, then the band- 
widths ax = T~ l ^ 2k+l } and ut = T" 1 / 2 lead to a locally and globally rate- 
optimal estimator 9t(x). The rate of convergence of this estimator is then 
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ji-k/(2k+i) _ ^his choice f the bandwidth cut is clearly nonadaptive, since it 
depends on the unknown parameter k. 

In the case of an ergodic diffusion, one can arrive at the same estimator 
using the well-known formula 

(5) (a 2 (x)f s (x))' = 2S(x)f s (x), 

where fs is the invariant density of the process X. Using the occupations 
time formula and the martingale representation of the local time, one can 
check that (Ta 2 )" 1 K'({x — Xt)/a)a 2 (Xt) dt is a consistent estimator of 
(v 2 (x)fs(x)Y- Likewise, 2{Tv)~ 1 f Q T Q((x - X t )/v) dt is a consistent estima- 
tor of 2fs(x). It is now quite natural to define the estimator of S(x) as the 
quotient of these two estimators. 

2.2. The minimax sharp adaptive estimator. To simplify the exposition, 
we suppose in this section that the diffusion coefficient er(-) is identically 
equal to one. For any function h € L 2 (IR), let us denote by (ph(-) the Fourier 
transform of h(-) defined as 9?h(A) = J R e lXx h(x) dx. To avoid double sub- 
scripts, we write iff instead of ff s - Recall that, for any estimator St(-) = 
St{-,X t ) of the drift function S(-), we have defined 

R T (S T ,S)= [ B s [(S T (x)-S(x)) 2 ]f 2 s (x)dx. 

JR 

Some heuristic explanations of this choice of the risk function are presented 
in Section 4.1. It is proven in [6] that, in order that the estimator (4) be 
asymptotically minimax over a properly chosen Sobolev ball T,(k,R) (k is 
the order of smoothness and R is the radius), one should choose the kernels 
and the bandwidths as 



(6) 



4k \V(2fc+i) 



TrRT(k + l)(2k + l) 



K*(x) = - f (1 - u k+pT ) cos{ux) du; 

7T Jo 

vt = T -1 / 2 and Q(x) is any positive, differ entiable, symmetric function 
with support in [—1,1] and / Q(x)dx = 1. In equality (6), we used the no- 
tation pt = l/loglog(l + T). The estimator (4) defined by such a band- 
width and kernel will be denoted by S^(-). Note here that the Fourier 
transform of the kernel K* is (fx* (A) = (1 — |A| fc+PT ) + . The exact asymp- 
totic behavior of the maximum over T,(k,R) of the risk of this estimator is 
T~ 2k /( 2k + 1 )p(k,R) (see [6], Theorem 4 and Definition 2), where P(k,R) is 
Pinsker's constant [32]. Moreover, the following asymptotic relation holds: 

Rt(S t ,S) < — , 
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where ot(1) is a term tending to zero uniformly in S and the functional At 
is defined by 

A T (a,h,<p f ) = T f \X(l-h(aX))ip f (X)\ 2 d\ + 4 f \h(aX)\ 2 dX. 
Jr Jr 

Since for known k the optimal kernel is given by (6), it will be natural to 
select the adaptive kernel among the functions {Kp{x) = ir~ l J Q (1 — u^) x 
cos(ux) du\P > 0} in a data-driven way. Thus, it suffices to give a good 
adaptive choice of the real parameters a and (5 in order to obtain an adaptive 
estimator of S. The values of these parameters that are of interest for us are 
those minimizing the risk Rt(St,S) or, equivalently, Ar(a,/i(3,yj), where 

h p (X) = (l-\Xf) + . 

The minimizers of At depend obviously on the unknown function S, so they 
cannot be used in an estimation procedure. A standard method for overcom- 
ing this difficulty is to estimate Ay(a, h,tpf) by a data dependent functional 
lT(ct,h) that does not involve the function S. Then the minimizers of the 
latter functional might be chosen as parameters for the adaptive procedure. 
Perhaps the most straightforward idea for estimating Aj(a, h,ipf) is to uti- 
lize the plug-in estimator At(q, h, <pt)i 0t being the empirical characteristic 
function 

0r(A) = ± J Q T j XXt dt. 

But it is well known that the plug-in estimators of quadratic functionals have 
a large bias (cf., e.g., [12]). That is why a smarter solution consists in apply- 
ing the plug-in method to At considered as a linear functional of |</?j(-)| 2 . 
According to Lemma 1, a good estimate of |(/?^(A)| 2 is |^>t(A)| 2 — 4/(TA 2 ). 
On the other hand, the minimization of AT(ct,hp,ipf) w.r.t. parameters a 
and (3 is obviously equivalent to the minimization of 

A T (a,h p ,<p f ) = T [ X 2 {hl{aX)-2hp{aX))\ip f {X)\ 2 dX + A f \hp(aX)\ 2 dX, 

JR JR 

since it is just Ar(a,/i^,^/) — T / R A 2 |^/(A)| 2 dX. For this reason, we define 
the functional 

l T (h)=T f X 2 (h 2 (X) -2/i(A))|(^ T (A)| 2 dA-4 / (h 2 {X) - 2h(X)) dX 
Jr Jr 

(7) +4 / \h(X)\ 2 dX 

Jr 

= T f X 2 (h 2 {X) -2/i(A))|<MA)| 2 dA + 8 / h(X)dX. 
Jr Jr 
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This functional depends on the observed path via the empirical characteris- 
tic function 0t- To obtain the adaptive kernel Kp and the adaptive band- 
width a, one should minimize the expression lx{h) over a suitably chosen 
subset 7ij< of the set 

U T = {h : x >-> hp(ax) = (1 - \axf) + \a G [T -1 / 3 , (logT)- 1 ], p > 1}, 

such that = N. The subset is defined as in [4]. For any pair of 

positive integers i and j, let us denote 

(8) ai =(l + -^—\ and p j= (l 



logTJ ' J V logT, 

The finite subset Tij, of Ht is defined as 

H N = {h . x „ (1 _ \ aiX f 3 ) + \ ai G [ T - 1 /3 ) (logT)" 1 ], j = 1, . . . , LlogTj}, 

where \_a\ denotes the largest integer strictly smaller than the real number a. 
It is evident that the cardinality of Tij, is less than (logT) 3 . From now on, we 
denote the N elements of this set by h\, h,2, ■ • ■ , h-N- Thus, to construct the 
adaptive estimator, the functional It is maximized over a set of cardinality 
not exceeding (logT) 3 . Let us now summarize the method. 



2.3. Brief description of the procedure. We start by computing the values 
cti and Pj according to (8). Then we determine the function hr G TL^ such 
that Zt(^t) = m ^ n heH N ^T(h). If the function satisfying the latter equality is 

not unique, we denote by hx one of them. Next we apply the inverse Fourier 
transform to hx in order to define the kernel 

Kt{x) = — f ^r(A) cos(Ax) dX. 
2ir Jm. 

This form of the kernel comprises the bandwidth since hx(X) = hs^^a-rX), 

where cut arid pr are the values of «j and Pj corresponding to hj-- Further, 
we choose another kernel function Q(-) which is positive, differentiable, sym- 
metric, supported in [—1,1] and with integral equal to one. Finally, we set 
e T = eV lo s T j £ T — 1/logT and define the estimator 

s (x) = ^k' T {x-x t )dt 

T 2VTfTQ(VT(x-X t ))dt + 2VTe T e- e T\z\' 
Note that the function Kt(-) is differentiable, since min Pj > 1. 
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3. Assumptions and main results. We introduce four conditions playing 
an important role throughout this paper. They ensure the existence of the 
observed diffusion process as a solution of (1) and provide us with some 
technical tools permitting us to deal with this process. Before stating these 
conditions, we need some additional notation. 

Recall that the solution of the stochastic differential equation (1) is a 
strong Markov process. We denote by Pt(S,x,A) the transition probability 
corresponding to the instant t, that is, 

P t (S, x,A) =P s (X t £A\X =x) Vx £i,VA£ #(R). 

Here P5 denotes the probability measure on (CQR), &c(R)) induced by the 
process (1). For every j£l and t > 0, the probability measure Pt(S,x,-) 
is absolutely continuous with respect to the Lebesgue measure. The corre- 
sponding density will be denoted by p t (S,x,y), so that, for any integrable 
function g(-), we have 

(9) E s [g(X t )\J? s ]= f g(y) Pt _ s (S,X s ,y)dy. 

Jr 

Let k be a strictly positive integer. Denote by the set of all functions 
satisfying the following conditions: 

CI. The function S is /c-times continuously differentiable in the whole real 
line and limsup^j^oo S(x) sgnx < 0. 

C2. There exist positive numbers C and v such that | S 1 ^- 1 (x) | < C(l + |x| l/ ), 
Vx G R. 

The problem we consider is the following: we know that x T is a sample 
path of the process X T given by (1) with a drift function S G £ = Ufc>i £(&) 
and we want to estimate the function S(-). To obtain minimax results, we 
consider the local setting. For any function Sq G S(fc) and for all 5 > 0, we 
define the neighborhoods V$(Sq) = {S G E| sup xeR \ S(x) — Sq(x)\ < 5} and 

V S (S ) = Is G E(ife) su P |5 (i) (x) - S$\x)\ < 5,i = 0,1, .. . ,k - l\. 

I xm J 

The center of localization Sq(-) is assumed to fulfill the following additional 
assumptions: 

C3. There exist a positive number k and a q > 1 such that the quantity 

su Pt> K E 5 [sup 2/eR p|(5'o,X t _ K ,y)] is finite. 
C4. Let </>o(') be the Fourier transform of the invariant density /s (")- There 

exists r > such that f R \\\ 2k+2+T \(p (\)\ 2 d\ < 00. 

Conditions C1-C4 need perhaps some comments. The first one ensures the 
ergodicity (see [18]) of the solution of the stochastic differential equation (1). 
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This condition entails also the exponential smallness of the tails of fs(-)- The 
second condition guarantees the square integr ability of the functions /o (•)> 
for every i = 0, 1, . . . , k. 

Condition C3 is a technical one and can be considered a mixing property 
of the underlying diffusion process. It can be viewed as a weakened version 
of the condition Cr2(s, a) from [3]. Some sufficient conditions for C3 are given 
in Section 4.4. 

Finally, C4 means that the central function Sb(-) is a little bit smoother 
than the other functions of the neighborhood. For example, if Sq(-) is (k + 1)- 

times differentiable and Sq C+1 \-) increases at most polynomially, then C4 is 
satisfied with r = 2. 

We define now the Sobolev balls; in our setup they also are weighted by 
the square of the invariant density. Let us denote 

X 5 (k,R,S ) = (seV s (S ) f [(S-S ) {k \x)] 2 fl(x)dx<R 
I Jm. 

To simplify the notation, we write £5 instead of £<$(£;, i£, So). 

We state now the main theorem of this work describing the asymptotic 
behavior of the estimator St constructed in the previous section. 

Theorem 1. Let So satisfy assumptions C1-C4 and let the risk Rt(-,-) 
be defined by (2). If the initial condition £ follows the invariant law, then 

limsuplimsup sup T 2k ^ 2k+ ^R T (S T , S) = P(k, R), 



where P(k,R) = (2k + l)( 7r(fc+1) fc (2fc+1) ) 2fc /( 2fc + 1 ) J R 1 /(2fc+i) is Pinsker ' s con . 
stant (cf. [32]). 

Note that this asymptotic bound cannot be improved, since it coincides 
with the lower bound obtained in [6], Theorem 3. Hence, the adaptive esti- 
mator St behaves asymptotically as well as the best possible nonadaptive 
estimator, provided that the specific form (2) of the risk function is used. 

Consequence. As an immediate consequence of the above theorem, one 
obtains the rate-optimality of the estimator St when the error of estima- 
tion is quantified using the MISE over a compact set /Cel. That is, for 
sufficiently small values of S, we have 

limsup sup T~ 2k ^ 2k+ ^E S f (S T (x) - S{x)) 2 dx < C < 00. 

The reason for this is the uniform in S £ £5 boundedness of the functions 
fs and fg 1 on any compact set /C. 
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4. Remarks and extensions. 

4.1. The weight function. The choice of the weight function in the risk 
definition is mainly motivated by the weak equivalence of experiments. In 
fact, as is explained in detail in [6], the Gaussian white noise experiment 
having almost the same statistical properties as our model is 

(10) dY t = S(t)dt + [Tf (t)]~^ 2 dB t , teR, 

where Bt is a two-sided standard Brownian motion and /o = fs - On the 
other hand, according to Golubev [20], the asymptotically optimal lower 
bound of the maximum of MISE (over the Sobolev ball of smoothness k and 
radius R) in the model 

dZ t = 0(t)dt + er 1/2 {6 o ,t)dB t , tel, 

is equal to e 4k / < - 2k+ ^P(k 1 R)[j x I^ 1 (0 o ,t)dt] 2k / ( - 2k+1 \ Our aim is to find a 
normalization of the MISE via a weight function such that the resulting limit 
of the minimax risk does not depend on the central function 9q. This would 
hold if the integral of the Fisher information / _1 (#o, - ) were independent 
of #o- Obviously, this is not the case for (10). In order to obtain a model 
enjoying the desired property, we transform (10) by multiplying it by /o- We 
get 

(11) dV t = S(t)f (t) dt + y/T-ifoft) dB u teR. 

The integral of the inverse of the Fisher information associated with the 
last model is one, since /o is a probability density. On the other hand, since 
the function f$ is a regular functional of S, it can be estimated with more 
precision than the function S. At a heuristic level, this is the reason why 
estimating the function Sfs in 1? is equivalent to estimating S in L 2 with 
the weight function /J. 

Note also that the use of a weight function for estimating S over the 
whole real line is unavoidable, otherwise the risk of estimation will explode. 
Moreover, any deterministic weight function has to depend on the unknown 
function S (or, at least, on an upper estimate of S). Indeed, if we observe a 
path X T , it contains no information about the values of S that are outside of 
the interval where x* = min tg j 0i r] Xt an d x* = max^^y] Xt- Thus, 

the error of estimating 5 at a point x ^ [x*,sc*] is large when S(x) is large. 
Consequently, in order that the integral J k (St — S) 2 Qs be finite, the weight 
function q$ should be small when S is large. 
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4.2. The case of a general diffusion coefficient. Let us consider the case 
where a is an arbitrary positive function such that a 2 + o~~ 2 is bounded by 
some polynomial function. It is also assumed that a is (k+ l)-times differen- 
tiable and the condition CI is replaced by limsupui^.^ S(x) sgnx/a 2 (x) < 0. 

In this case, the functional At = Ar(«, h, ip a i f, ||cr|| 2 2 , «) has the form 



7|A0 
Jr 



l-h(a\))p a2f (\)\ 2 d\ + 4 I a 2 (x)f s (x)dx I \h(a\)\ 2 dX. 

Jr Jr 

It follows from this expression that, in order to construct an estimator It 
of At, one has to estimate not only the square of the Fourier transform 
\ip a 2f(\)\, but also the term ||cr||| 2 (j) = S^o~ 2 {x)fs{x)dx. Fortunately, this 
latter quantity is just a linear functional of fs and therefore can be estimated 
with a parametric rate T~ 1 / 2 . Let us denote a T = T _1 J a 2 (Xt) dt; it is an 
efficient estimator of ||c||^2/^- An almost unbiased estimate of \(p a 2j(\)\ 2 is 
then |c/?r(A)| 2 — 4a T / (T A 2 ) . The empirical characteristic function in this case 
has the form T (\) =T~ X Jq e iXXt a 2 (X t )dt. Accordingly, the functional l T 
is defined in this case as 

l T (h)=T f \ 2 (h 2 (\)-2h(\))\vT{X)\ 2 d\ + 8a T [ h{X)d\. 
Jr Jr 

The remaining steps of the construction of the adaptive procedure do not 
need any modification. That is, we define Kt by It^t) = min ftG ^iv W{h)- 

Next we apply to Kt the inverse Fourier transform in order to define the 
kernel 



Kt{x) = — [ fiT (X) cos(Xx) d\. 
27T Jr 



Further, we choose another kernel function Q(-), which is positive, differen- 
tiable, symmetric, supported in [—1, 1] and with / Q(u) du = 1. Finally, we 



set £t = e"v og , It = 1 /log T and define the estimator 

Jo T ' a 2 (X t )k T (x-X t )dt 



S T (x) - 



2a/77 T Q(VT(x - X t )) dt + 2VTe T e- £ T\A 



The only thing that changes in Theorem 1 is the limiting constant. In this 
case the choice of a specific weight function does not allow one to obtain a 
limiting bound independent of Sq. The constant that we obtain is 



P(So,<r, k, R) = P(k, R) (J a 2 (x)f (x) dx 



2fc/(2fc+l) 
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4.3. What happens if the diffusion is not ergodicl . Note first that a dif- 
fusion, like any Markov process, can be positively recurrent, null recurrent or 
transient. Our method of adaptation, as well as the other methods suggested 
in the statistical literature for estimating adaptively the drift function, uses 
heavily the fact that the variance of the stochastic component in the risk de- 
composition is of order 1/(Tckt), where ay is the bandwidth or a smoothing 
parameter. This condition, as can be derived from the asymptotic equiv- 
alence result proven in [7], is not satisfied in the case of a null recurrent 
diffusion. The variance in that case is of order 1/(\/To.t) and, consequently, 
the rates of convergence of drift estimators are significantly worse. 

As to transient diffusions, even the simple feature of consistency fails for 
any estimator, since the amount of information concerning the value of S 
at a point x contained in the observed path X T does not increase when T 
increases to infinity. That is the main reason for separating the ergodic case 
from the others. 

4.4. Sufficient conditions for C3. A wide class of drift functions S satis- 
fying C3 is the set of all bounded functions: it is proven in [16] that there exist 
two positive constants c\ and C2 such that pt(S,x,y) < cii -1 ' 2 e~ C2 \ x ~ y \ . 

In the case when Xt follows the invariant law, condition C3 is satisfied 
(with any q < 2 and any k > 0) if S is differentiable, satisfies condition 
CI and S(x) 2 + S'(x) > c for some constant c£K. Indeed, formula (7) on 
page 95 in [18] implies 



where <j){-) stands for the density of the standard normal law. Therefore, 



since, under condition CI, the function f$ decreases exponentially fast. 
5. Proofs. 

5.1. An auxiliary result. We start with a proposition reducing the study 
of the performance of drift estimators to that of the invariant density and 
its derivative estimators. From now on we suppose for simplicity that a = 1. 
Suppose now that Jt{') and are estimators of the invariant density 

fs(-) and its derivative /§(•)> satisfying the conditions 



El. There exist C 1)7 > such that E s [/ T (s) - fs{x)) 2p < CfT~P e -^ for 
any x S M, p > 1 and S £ 
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E2. There exist two positive numbers C2 and q such that Eg[/p (x) 4 ] < 

C 2 T q , for any x £ R and 5 G £5. 
E3. The estimator f^\-) is asymptotically efficient, that is, 

lim lim sup T 2fc /( 2fc+1 ^E 5 / (f£\x) - f' s {x)f dx = AP(k, R). 
5->0T->oo c ~ Jr 

Following some heuristics related to the identity S(x) = f' s (x)/2fs(x) and 
presented in Section 2.1, we define the estimator of S(x) as 

(12) ^ (x) = 2/ r (x) + 2r-i/L re -^i' 

1 



where e T = T 1 /^ 1 ^ = an d £ T = (l og T) 

PROPOSITION 1. If conditions E1-E3 are fulfilled and So G S(fc) satisfies 
C4, i/ien we /iaue 

lim lim sup T 2fe /( 2fe+1 ) i? T (^T, 5) = P(k,R); 

that is, the estimator St is asymptotically minimax. 

Proof. The proof of this result relies on the Markov inequality and the 
exponential inequalities proven in Lemma 4 of [6] . It is quite similar to the 
proofs of Theorems 4 and 5 of [6] and therefore will be omitted here. For 
more details, we refer the reader to Theorem 6 of [5]. □ 

5.2. Proof of Theorem 1. In the sequel the letters C and D stand for 
generic constants; the notation is used for the L 2 (M, (ix)-norm of a func- 
tion h. We assume that the initial value £ follows the invariant law; thus, 
the process X is stationary in the strict sense. 

Note that the estimator St defined in Section 2.3 is of the form (12) with 

$\x) = T" 1 Jq K' t (x — Xt) dt and f T (x) = T~ l l 2 / T ' Q(VT(x - X t ))dt. 
Therefore, it suffices to verify that conditions E1-E3 hold. It is easy to 
see that Lemmas 4 and 5 and the arguments of Section 4.3 in [6] yield El. 

Condition E2 states that the fourth moment of fj >(x) is bounded in x 
and increases in T at most as a polynomial. This condition is evidently 
fulfilled, since 

K' T {x) = -(2vr)- 1 / A^t(A) sin(Ax) dX 



and the integrand above is bounded in absolute value by |A|lr T i/ 3 r i/3i (A) 
(recall that hr is supported by the interval [— a^ 1 , a^ 1 ], which is a subset 
of [— T 1 / 3 ,! 11 / 3 ]). Therefore, Bsif^ix) 4 ] is bounded by CT 8 / 3 . 
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It remains to verify E3, which is the most important part of the proof. 
For any estimator f^\-) of /#(•), we define the risk rx{f^\ fjg) as the mean 

integrated squared error, that is, rr(/y , /#) = Es||/^ — f' s \\ 2 - Due to the 
Plancherel identity, we have 

r T (f?J' s ) = ^s f \VMW-<Pf'W\ 2 dX 

27T Jr JT 

= ^-E s f |<MA)<^,(A)-^(A)| 2 dA, 
2ir Jr 

where we have used the notation (^t(A) = T~ l f^e lXXt dt and the fact that 
the Fourier transform of the convolution of two functions is the product 
of the Fourier transforms of the functions. Now, due to the formula of the 
Fourier transform of a derivative, we have 

r T (f?J' s ) = ^s [ |A| 2 |[(^ T - v?/ )(A)]/ lT (A)-^ / (A)(l-^(A))| 2 dA. 
2tt Jr 

This latter form of the risk is convenient since the term <pt(X) — y?/(A) is 
unbiased. Unfortunately, the randomness of the function hx does not allow 
us to apply to the risk the standard bias-variance decomposition. To 
bound this risk, more careful treatment of the main part of the stochastic 
component is required. 

Lemma 1. For any AEffi, we have 

A(£r(A) - <^(A)) = 2iC T (A) + T- 1 ' 2 m s (\,X T ), 

where Ct(A) = T" 1 / 2 J T e lXXt dW t and m s (X,X T ) is a measurable function 
taking complex values such that, for sufficiently small values of 5 > 0, 
suPseE Jo Es|m 5 (A, X T )\ 2 dX < C. 

From now on, for two functions of T, say, ax and bx, we write a? ~ &t if 
the function ot /&t tends to one as T — > oo uniformly in all the parameters 
entering in the definitions of these functions (in particular, uniformly in 
S G S<5 , for sufficiently small values of 5) . Using Lemma 1 and the fact that 
oti > T -1 / 3 , one can show that 

RT(f£\fs)~ir'ns f \2a , -V 2 CT(\)h T (\) + \<Pf(\)(l-hr(\))\ 2 dy 
2ir Jr 

The last expression is obviously of the same order as 

-4-E s / \( T (\)h T (\)\ 2 d\ + ^B s / (l-h T (\)) 2 \\<p f (\)\ 2 d\ 

2tt1 Jr 2tt Jr 

--i-=ImE 5 / AMA)(l-MA))CT(A) V/ (-A)dA, 
2ir\/T Jr 
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where Imz is the imaginary part of the complex number z. This relation 
can be rewritten as 

(13) RAf^j's) ~ ^f(E s [A T (l,h T ,<p f )} + A 1 - ImA 2 ), 

where Ai = 4E 5 / R /i^(A)(|Ct(A)| 2 - l)dA and 

A 2 = 4VTb s I \h T (\)(l-hr(X))tT(X)<Pf(-X)d\. 

JR 

Note also that, from the definition of the functional It, one gets 
l T (h)= f 8h(X)dX + T f (h 2 (X) - 2h(X))\X Vf (X)\ 2 dX 



— 2TRe / (h 2 (X) - 2h(\))\tpf(—\)(<pr(\) - dX 
Jr 

+ T [ X 2 {h 2 {X)-2h{X))\$ T {X)-v f {X)\ 2 dX. 

JR 

Using once more Lemma 1, we get 

lr(h)~ A T (l,h,(ff)+4VTlm f {h 2 (X) - 2h(X))X<p f (-X) Ct(A) dX 

JR 

+ 4 / (h 2 (X) - 2h(X))(\( T (X)\ 2 -l)dX- T\\<p f ,\\ 2 , 

JR 

uniformly in h € Ht- This relation yields 

(14) E s [l T (h T )} ~ E S [A T (1, hr,<p f )] - T\\<p f , \\ 2 + lmA 3 + A 4 , 

with the notation A 3 = AVT E s f R (h T (X) - 2h T (X))X(p f (-X)CHX)dX and 

,44 = 4E5 / (h 2 T (X)-2h T (X))(\( T (X)\ 2 -l)dX. 

JR 

We wish to show that the terms A1-A4 are asymptotically smaller than 
Es[At(1, h, <pf)] when T tends to infinity. This can be done using the two 
following lemmas, the proofs of which are deferred to Section 5.3. 

Lemma 2. Let h(-,w) be a bounded random function which takes only 
N different values h 1 ,...,h N . Then E s | J u h{X)C, T {X) dX\ < Cy/NE s " h "' 2 
where the constant C depends only on k,R,So. 



As a consequence of this lemma, we obtain A 2 V A3 < Cy A r Eg[Ax(l, h, <Pf)] 

Lemma 3. For any random function h(-,co) taking only N different val- 
ues hi(-), . . . , ^at(-) such that \\hi\\ 2 < T, the following inequality holds: 



sup Eg 



h{X){\t T {X)\ 2 -l)dX 



<CVN^T\Vs\\h\\ 2 
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where et = T 1 / V losT and the constant C depends only on k,R,So. 
As a consequence of this lemma, we obtain the inequality 

/ ^(A)(|CT(A)| 2 -l)dAl <CVN^JVs\\h T \\ 2 



< CetJ A T (1, h T ,ip f ), 



for any integer n > 0. This inequality implies that A\ <Cet\/ hx, </?/) 



and A4 < Cet y Ar(l, hr, 9?/). Now (13) and (14) can be rewritten in the 
form 

1 



Rt(I^Js) < —{^s[AT{iM^f)] + CE T y 



Vs[Hh* T )] < (JE s [A T (l,h^ip f )] + CE T ) 2 -T\\ip f , 



VsMh T )] > (^Es[A T (l,h T ,v f )]-CE T ) 2 -T\\ip f \\ 2 . 

Here /i^(A) = (1 — |a^A| fc+PT )+. Taking into account the fact that hx mini- 
mizes the functional It(-) over , we get Es[/t(/it)] < m ^ n heH N ^s[^T(h)]. 
On the other hand, by arguments very similar to those of Lemma 5 in [4], one 
checks easily that min h <- n N Es^^/i)] ~ min^g^y F,s[lr(h)] < Es[Zj-(/i;^)]. 
Combining all these estimates, we arrive at the inequality 

1 



RtU^J's) < -(/ahv4^) + Ce t ) 2 (i + or(i)). 



Therefore, the expression T k /( 2k+1 )y/B T (f!pJ' s ) is asymptotically bounded 
by v /r- 1 /(2fc+i)A T (l,/i^,^ / ), plus a residual term CT 1 /v^- 1 /(4fc+2) _ It 
is well known in the theory of minimax estimation that the supremum of 
the quantity T' 1 ^ 2 ^ A T (1, h* T , iff) over the Sobolev ball £(/c + 1,412) = 

{/ : - /o fc+1) II 2 < 412} tends to the constant 4P(jfe, R). This completes 

the proof of the theorem, due to the inclusion T,(k,R) C £(£; + 1,412 + 05(1)) 
(see the proof of Theorem 5 in [6]). 

5.3. Proofs of technical lemmas. 

Proof of Lemma 1. First of all, note that E 5 [^t(A)] = 9?/ s (A). Now, 
taking into account the occupation times formula ([33], page 224) and the 
martingale representation of the local time estimator ([26], page 137), we ob- 
tain ^ T (A)-E s [^ T (A)]=T- 1 (H5(A,X r )-H 5 (A,Xo))-T- 1 £ 9s (\, X t ) dW t 
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where the functions H$ and gs are defined as 



g s (X,u) = 2 I e lXx f s {x) 



^{u>x} - F s (u) 



dx, 



H s (X,u) = 2 I e tXx f s (x) 



fs(u) 
^{v>x} - F s (v) 
fs(v) 



dv 



dx. 



The integration by parts formula yields 

iXgs(X,u) = 2e iXu -~gs(\,u), 

where 



g s (X,u) = 2 I e iXx f' s (x) 



^{u>x} - F s {u) 
f S (u) 



dx. 



It implies that X(<p T (X) - <pf(\)) = T^2i J Q T e iXXt dW t + T- l / 2 m s {X,X T ) 
with VTm s (X,X T ) = X(H S (X,X T ) - H S (X,X )) + i gg s (X,X t ) dW t . In 
the same way one can prove that \iXHs(X,u)\ = \ Jq iXgs(X, v) dv \ < 2\u\ + 
| Jq gs(X,v) dv\. Using the Plancherel identity and Lemma 4 from [6], we get 

-FsiO 1 2 



E s \g s (\0rdX = 2iT / f s (xyB s 



fs(0 



dx < C, 



dX 



where C is a constant independent of S £ £5. Similarly, one checks that 
JjgEsI gs(X, v) dv\ 2 dX < C. Thus, we have 

^ T Es|m s (A,X T )| 2 ( iA<^ T (^|E5|iA J F/5(A,0| 2 + 2E s | 5 5(A,e)| 2 )dA 

<16E 5 [£ 2 ] + 16 [ E s [ g s (X,v)dv 

JR JO 

+ 2 f B s \gs(X,0\ 2 dX<C, 

JR 

and the assertion of the lemma follows. □ 

Proof of Lemma 2. Let us denote & = / R h( T - It is evident that 

h(X)( T {X)dX =Ec 



E.« 



\Z h \) < jEs\\h\\*V 3 \Z h \ 



< 



N 



B s \\h\\ 2 J2Vs\Z hl \ 



i=l 



1/2 



Now, taking into account the explicit form of £r, we have 



^s\&h\ 



1 



rEc 



e iXXt hi(X) dXdWt 
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1 



< 



Nl 2 
c 

Jh~¥ 



\h; 



^E 5 |^(e)i 2 



\Vhi{x)\ dx 



where C = sup Se -£ s sup x£ ^ fs( x )- This completes the proof of Lemma 2. □ 

Proof of Lemma 3. The ltd formula implies that, for any continuous 
martingale Ait, we have M 2 T -M 2 = 2j^M t dM t + (M) T , where (M) t 
is the quadratic variation of the martingale Ait- Applying this formula, 
we get T|£r(A)| 2 = 2 J T Y t (X)dWt + T, where we have used the notation 
Y t (X) = Ree iXXt J* e" jAX " dW u = Ree iAX * Vt( t (X). Changing the order of the 
integrals and using the Ito isometry, we get 

2 



U S (h) := E £ 



h(X)(\( T (X)\'-l)dX 



4 



Y t (X)h(X) dX 



h(X)Y t (X)dX 



dWt 



dt 



( t (X)h(X)dX 



dt. 



We apply now the same method as in the proof of the first lemma. Let us in- 
troduce £h = / K /i(A)(|Cr(A)| 2 — 1) dX. The Cauchy-Schwarz inequality 
yields 



Ec 



/ 1 (A)(|Ct(A)P 



l)dA 



A' 



i=l 



It remains to carry out the inequality Eg|^.| 2 = ||/i.j||~ 2 Us (hi) < C, where 
C is a constant depending only on k, R, So. The proof of this inequality will 
be divided into three steps. 

First step. Suppose that T > k, where k is the positive number defined 
by condition C3. Then we have the inequalities 



Ec 



tE s 



jkXt 



iXXi 



Ct(X)hi(X)dX 



dt<K z [ / hi(X)dX) <AK z a 



< i2K 2 r||/ li || 2 , 



hi(X) 



e i\x u dWudX 



t — K 



2 


ra . 1 






<E 5 


[/J, 


f e iXXu dW u 


dX 




J t — K 
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2 


J —a . 


f e lXXu dW u 


dX 


J t — K 





These inequalities imply that 



u s {K 

\\h,P 



<C + 
= C + 



INI 2 ^ 2 jk 

INI 2 ^ 2 ' 



< Aa~ 2 k < 12KT\\hi\\ 2 
e iXXt hi(X) 



t — K 



e iXXu dW u dX 



o 



dt 



We want to prove now that Us(hi) < CerT 2 ||hj|| 2 for a constant C. Re- 
call that X u denotes the trajectory of X between and u. Since the ran- 
dom variable ry(X*~ K ,A) = Jq~ k e lXXu dW u is J^f_ K -measurable and the law 
&(X t \& t - K )=Jf(X t \X t - K ), we have 



Us {hi 



E. 



o iX Vr 



r 1 {X t - K 1 X)h i {X)dX 



p K (S,X t _ K ,y)dy 



dt. 



Let us denote Qpf*- K ,y) = | J R e^??(X*- K , A) hi(X) dX\. 

Second step. To simplify the notation we suppose that n = l. Let us prove 
now that, for any function Q(-), the following inequality holds: 



(15) 



Q 2 (y)Pi(S,x,y)dy <e T sup pi(S ,x,y) / Q 2 {y)dy 

j/eR Jr 



+ ^Jj^Q 4 {y)pi{S,x,y)dy, 



da 



for any S G Vs(Sq) with 5 < 0.2. Indeed, if we denote by E| the mathe- 
matical expectation with respect to the measure induced by the solution 
of (1) with deterministic initial value Xq =x, then J^Q 2 (y)pi(S,x,y)dy = 
E* s [Q 2 (X 1 )]=E So [Q 2 (X 1 )L(S,X 1 )}, where 

L l (S,X 1 )=ex v y\s(X u ) - S (X u ))dW u - ± j\s(X u ) - S (X U )Y 

is the likelihood ratio and X 1 = (Xt, < t < 1) denotes the trajectory of the 
process X up to the time 1. Further, using the Cauchy-Schwarz inequality, 
we get 

E£[Q 2 (Xx)] < e T B So [Q 2 (X 1 )} + y/E* s [Q*(X 1 )] y /p* s {L{S,Xi)>e T ). 
Note now that, for any n > 0, we have 



L(S,X l ) n = exp 

< e 



[ Ml -- { M) 1 + — 



n 



(S(X u )-S Q (X u )) 2 du 



Mi-(l/2)(M>i p n 2 <5 2 

1 
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where M t = n J Q (S — Sq)(X u ) dW u is a local martingale. Therefore, taking 
n = 5 y/log T and applying the Chebyshev inequality, we get P S (L(S, X 1 ) > 

E T ) < e^V^gSBi^ogT < r _4 ( which leadg tQ (15) _ 

Third step. Inequality (15) yields U S {h) < ffiD^t) + T~ 2 V 'D 2 (t)) dt, 
where we have used the abbreviations 



D 1 (t) = e T B s 



sap Pl {So,Xt-i,y) [ f e iXy h i (X)r](X t ~ 1 , X) dX 

y£R JR JR 



2 



dy 



and D 2 (t) = -Esf R Q 4 (X t -\y) Pl (S,X t - 1 ,y)dy = E s [Q\X t - 1 ,X t )]. The term 
D\ can be evaluated via the Plancherel identity and the Holder inequality: 



Di(t) = 2s t ttEs 



supp 1 (S ,X t _ 1 ,y) [ \h i (\)ri(X t - 1 ,\)\ 2 dX 
y JR 



<C q e T (t-l)(E s 



suppi(5 ,A" ,y) 9 
y 



hi\\\ 



in view of the estimate E[|ry(X* _1 , X)\ 2s ] < C s (t — l) s , for any s > 0, which 
follows from the BDG inequality ([33], Theorem IV.4.1). Using condition 
C3 and the [uniform in S G V^(5o)] boundedness of fs/ fs i we S e t D\{t) < 
Ce T (t-l)\\hi\\ 2 . 

One easily checks that | Q (X f - 1 , X t ) \ 2 < Jj Q . A , < 1 \ r, (X f - 1 , A) | 2 dA 1 1 hi \ \ 2 and , 
hence, D2(t) < Ct 2 ||/ij|| 8 < Ct 2 T 2 ||/ij|| 4 . Combining these estimates, one ob- 
tains Us (hi) < CT 2 1| /ij || 2 , which completes the proof. □ 
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